Introduction

The Healthy Brain Network data set contains data from 2577 participants who have each participated in a subset of 119 available assessments. The assessments range in both purpose and cost, and the goal of this analysis is to identify a set of assessments that is both inexpensive and reasonably predictive of learning disabilities. Value to the customer (both clinicians and patients) comes from alleviating the need for expensive, time-consuming assessments when they aren’t required.

library(tidyverse)
library(plotly)
# library(raster) % not loaded in its entirety; use "raster::"

Data Files

This analysis starts with a clean copy of the most recent release - HBN Release 7.0.

setwd(dir = "C://Users//Mike//Dropbox//PhD//hbn-analysis/")
all_files <- dir("./hbn-data-rel-7-20191018/")
# One file describing "DailyMeds" seems particularly 
# problematic due to multiple EID columns.
files_to_ignore <- c("DailyMeds")
for(file in files_to_ignore){
  all_files <- all_files[!str_detect(all_files, file)]
}
# Extract test name abbreviations.
test_names <- all_files %>% 
  str_replace("9994_", "") %>% 
  str_replace("_20191018.csv", "")

Study Participant Identification

The file structure is such that each assessment is described by a unique file. Only the participants associated with each assessment will be listed in the assessment file. We’ll compile a master list of all participants by considering the unique EIDs across all assessments. The common key for all assessment files is the “EID” column. We’ll first validate the number of unique EIDs across all assessment files.

EID_list <- NULL
file_count <- 0
for(file in all_files){
  file_count <- file_count + 1
  # print(str_c("file ", file_count, " of ", length(all_files), " part 1"))
  test_list <- read_csv(str_c("./hbn-data-rel-7-20191018/", file)) %>% 
    select(EID) %>% 
    pull()
  EID_list <- c(EID_list, test_list)
}
EID_list <- EID_list %>% unique() %>% sort()
EID_list <- EID_list[grepl("^ND.*", EID_list)]
length(EID_list)
## [1] 2577
# 2577 unique EIDs.

Test Participation by Participant

We know there are 119 tests and 2577 unique participants, and we’d now like to get a feel for how many people have taken each test. Ultimately we want to find a subset of tests that a large portion of the study population have all completed. Eventually we could consider imputing small amounts of missing data, but let’s begin with a complete case approach. It is important to note that not all tests are appropriate for all participants (e.g., tailored to a specific sex or age group).

# Create the desired data frame structure.
EID_test_participation <- matrix(NA, nrow = length(EID_list), 
                                 ncol = length(all_files) + 1)
EID_test_participation <- as_tibble(EID_test_participation)
colnames(EID_test_participation) <- c("EID", test_names)
EID_test_participation$EID <- EID_list
# Populate the test participation data frame
file_count <- 0
for(file in all_files){
  file_count <- file_count + 1
  # print(str_c("file ", file_count, " of ", length(all_files), " part 2"))
  test_list <- read_csv(str_c("./hbn-data-rel-7-20191018/", file)) %>% 
    select(EID) %>% 
    pull()
  EID_test_participation[, file_count + 1] <- 
    EID_test_participation$EID %in% test_list
}

test_participation <- tibble(test = test_names,
                             num_participants = 
                               apply(EID_test_participation[,-1], 2, sum)) %>% 
  arrange(-num_participants) %>% 
  mutate(test = fct_reorder(test, rev(order(num_participants))))

Plot observed counts to display test participation (least to greatest).

plot1 <- ggplot(test_participation) + 
  geom_point(aes(x = test, y = num_participants)) +
  labs(title = "Number of Participants for Each HBN Assessment",
       x = "Assessment Title",
       y = "Number of Participants") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  coord_flip()
ggplotly(plot1)

Pairwise Co-Participation in Assessments

Let’s examine how many assessments each pair of participants have in common. This may give us a general idea of how many of the 119 assessments we might be able to use for a certain-sized subset of the study population.

EID_test_matrix <- EID_test_participation %>% 
  select(-EID) %>% 
  as.matrix()
# Sort the matrix row-wise by highest to lowest row sums.
EID_test_sum <- apply(EID_test_matrix, 1, sum) %>% order()
EID_test_matrix <- EID_test_matrix[EID_test_sum, ]
EID_pairwise <- EID_test_matrix %*% t(EID_test_matrix)
# Reverse the columns.
EID_pairwise <- EID_pairwise[,length(EID_test_sum):1]
par(mar = c(0,0,4,0))
raster::plot(x = raster::raster(EID_pairwise), axes = FALSE, box = FALSE, xlab = "Participants", ylab = "Participants", main = "Pairwise Co-Participation in Tests\n(Participant Index vs. Reordered Participant Index;\nNumber of Tests)")

This figure should be somewhat concerning. Here we see only the bottom left quadrant (roughly one quarter of pairwise calculations) sharing ~50 or more assessments at the subject-subject pair level (out of 119 possible assessments). There are no guarantees that the shared tests in any of these pixels of similar color refer to the same set of shared tests between other subject-subject pairs. As an alternate view of the same distribution, we’ll construct a histogram of pairwise co-participation counts.

max_shared_assessments <- max(EID_pairwise)
par(mar = c(5.1, 4.1, 4.1, 2.1))
hist(EID_pairwise, breaks = 0:max_shared_assessments, 
     main = "Pairwise Co-Participation in Tests", 
     xlab = "Number of Shared Tests")

Again we have a large majority of pairs sharing less than 45-50 shared assessments at the subject-subject pair level. A good guess at this point might be that we can find many hundreds of people who’ve all taken the same dozens of tests. There will no doubt be a tradeoff in which increasing the sample size forces tests to be dropped due to lack of sufficient participation.

Usable Assessments at Various Sample Sizes

We’d like to now find the set of assessments that we can use for various sample sizes. For now we’ll assume we want complete cases (i.e., no missing assessments for any participants). It is worth noting that this does not imply anything more than a subject’s EID number is on the roster for having taken the assessment. There could still be missing values or data entry errors.

# Sample Size Options
test_n <- c(seq(2500,100,-100))
# How many people do we get?
n <- rep(0, length(test_n))
# How many assessments do we get?
p <- rep(0, length(test_n))
test_name_list <- list()

The heuristic below simply considers all tests that had an overall participation exceeding \(n\). All people who completed all such tests will be retained. We return three things: a study size, a number of tests completed by all in the subset, and the names of the completed tests.

test_counts <- apply(EID_test_matrix, 2, sum)
for(test_i in 1:length(test_n)){
  this_test_n <- test_n[test_i]
  top_n_tests <- which(test_counts > this_test_n)
  EID_test_matrix_pruned <- EID_test_matrix[,top_n_tests]
  for(i in 1:nrow(EID_test_matrix_pruned)){
    for(j in 1:ncol(EID_test_matrix_pruned)){
      EID_test_matrix_pruned[i,j] <- 
        ifelse(EID_test_matrix_pruned[i,j] == TRUE, TRUE, NA)
    }
  }
  EID_test_matrix_pruned <- EID_test_matrix_pruned[complete.cases(EID_test_matrix_pruned),]
  n[test_i] <- nrow(EID_test_matrix_pruned)
  p[test_i] <- ncol(EID_test_matrix_pruned)
  test_name_list[[test_i]] <- colnames(EID_test_matrix_pruned)
}

Let’s take a look at the relationship between \(n\) and \(p\) now that we have a few different-sized complete-case subsets.

ggplot() + 
  geom_point(aes(n, p)) + 
  labs(title = "Complete Tests vs. Study Size",
     y = "Number of Completed Tests", 
     x = "Study Population Completing All Tests")

Let’s consider the two scenarios with 471 and 1592 participants who all completed 52 and 28 assessments, respectively.

Scenario 1: 471 subjects, 52 assessments, assessment list

n[10] # Number of Subjects
## [1] 471
p[10] # Number of Common Assessments
## [1] 52
test_name_list[[10]] # All of these subjects completed all of these assessments.
##  [1] "APQ_SR"                     "ARI_P"                     
##  [3] "ARI_S"                      "ASSQ"                      
##  [5] "Barratt"                    "Basic_Demos"               
##  [7] "BIA"                        "C3SR"                      
##  [9] "CBCL"                       "CELF_5_Screen"             
## [11] "CIS_P"                      "ColorVision"               
## [13] "ConsensusDx"                "CTOPP_2"                   
## [15] "DTS"                        "EEG_TRACK"                 
## [17] "EHQ"                        "ESWAN"                     
## [19] "FitnessGram"                "FSQ"                       
## [21] "MFQ_P"                      "MRI_Track"                 
## [23] "NIH_Full"                   "NIH_Scores"                
## [25] "NIH_Scores_20191018_v2.csv" "NLES_P"                    
## [27] "PBQ"                        "PCIAT"                     
## [29] "Pegboard"                   "Physical"                  
## [31] "PPS"                        "PreInt_Demos_Fam"          
## [33] "PreInt_Demos_Home"          "PreInt_DevHx"              
## [35] "PreInt_EduHx"               "PreInt_FamHx_RDC"          
## [37] "PreInt_Lang"                "PreInt_TxHx"               
## [39] "PSI"                        "SAS"                       
## [41] "SCARED_P"                   "SCARED_SR"                 
## [43] "SCQ"                        "SDQ"                       
## [45] "SDS"                        "SRS"                       
## [47] "SWAN"                       "WHODAS_P"                  
## [49] "WIAT"                       "WISC"                      
## [51] "WISC_20191018_v2.csv"       "YFAS_C"

Scenario 2: 1592 subjects, 28 assessments, assessment list

n[5] # Number of Subjects
## [1] 1592
p[5] # Number of Common Assessments
## [1] 28
test_name_list[[5]] # All of these subjects completed all of these assessments.
##  [1] "APQ_SR"                     "ARI_P"                     
##  [3] "ARI_S"                      "ASSQ"                      
##  [5] "Barratt"                    "Basic_Demos"               
##  [7] "CBCL"                       "CELF_5_Screen"             
##  [9] "ColorVision"                "ConsensusDx"               
## [11] "CTOPP_2"                    "EEG_TRACK"                 
## [13] "EHQ"                        "FitnessGram"               
## [15] "NIH_Scores"                 "NIH_Scores_20191018_v2.csv"
## [17] "Pegboard"                   "Physical"                  
## [19] "PreInt_Demos_Fam"           "PreInt_Demos_Home"         
## [21] "PreInt_DevHx"               "PreInt_EduHx"              
## [23] "PreInt_TxHx"                "SCQ"                       
## [25] "SDQ"                        "SRS"                       
## [27] "SWAN"                       "WIAT"

Clearly there is a substantial reduction in sample size as the desired number of assessments increases. The challenge now will be to retain as many subjects and as many assessments as possible given a specific research question.

Forming a Hypothesis

We can take either of two approaches at this point: 1) throw in all the data and look for “something” or 2) look for something specific for which we think the data we have is relevant. Option 2 is more likely to produce a clinically relevant result. The next step should be to consider the sample sizes above (or others) and their respective assessment lists and develop an appropriate question given the available data.

Choosing a Target Assessment to Replace

From our initial meeting, a stated goal was to predict cognitive/language task performance using only the NIH Toolbox scores and other assessments outside the cognition/language category (excluding CBCL). From the image below, the targeted cognitive/language outcomes include:

  • Temporal Discounting
  • ACE
  • WISC-V
  • WAIS
  • WIAT
  • DAS
  • CELF-5
  • GFTA
  • CTOPP
  • TOWRE
  • EVT
  • PPVT
knitr::include_graphics("hbn-assessment-list.PNG", )

Considering the list of assessments identified previously with 471 participants, the eligible cognitive/language assessment outcomes we can try to predict include:

  • WISC-V
  • WIAT
  • CELF-5
  • CTOPP

In the larger group of participants (n=1592), we’d have just over half as many eligible predictors to predict three of these assessments:

  • WIAT
  • CELF-5
  • CTOPP

In the image below, it appears that WISC and WIAT require the most time (60-75 minutes and 45-60 minutes, respectively) and present the greatest opportunity for savings (~1 hour for visits 1 and 3), so proceeding with 471 subjects and two time-consuming targets is our current recommendation. We can also optimize over these targets (WISC and WIAT) to grow the predictor set (other assessments) either as large as possible (requires no domain expertise) or perhaps with some intended predictors in mind. This effectively changes the question from, “Does any subset of X predict Y?” to “Do X1, X2, and X3 predict Y?” Choosing X1, X2, and X3 intelligently requires domain expertise. We may undesirably discard one or more of them due to low participation if we don’t have another reason to try to keep them.

knitr::include_graphics("hbn-time-commitment.PNG")

Domain Expert Input

Several important insights emerged from a discussion with Lindsay about the assessments identified as candidates in the previous section. First, we should consider the difference in our two primary target assessments: the WIAT and the WISC.

The Wechsler Individual Achievement Test (WIAT) claims to be focused on achievement rather than intelligence. It measures academic capability, not necessarily potential. I had previously wondered about the relevance of some of the environmental questionnaires, but clearly a child’s environment can have tremendous impact on his or her academic achievement. This could include everything from parenting style to food security and fitness. Feeling safe at home and at school could also reasonably influence a child’s academic achievement.

The Wechsler Intelligence Scale for Children (WISC), however, seeks to assess intelligence and aptitude much like a traditional IQ test. We expect tests such as those contained in the NIH Toolbox to relate more directly to an intrinsic IQ measurement such as the WISC.

Michael provided some text excerpts identifying specific thresholds for certain learning disabilities.

Learning Disability Thresholds

A clinical definition of reading disability:

A standard score <85, on the Test of Word Reading Efficiency42 (TOWRE), Total Word Reading Efficiency (TWRE) < 85. The TWRE is a composite of the Phoneme Decoding Efficiency (PDE) and Sight Word Efficiency (SWE) Subtests (as such, children do not have to qualify for a DSM-5 diagnosis of Specific learning disorder with impairment in reading).

Alternatively, one could define a reading or math disability with this definition:

Specific learning disorder diagnoses were made by licensed psychologists based upon the reported educational history as well as results of the WISC-V, Wechsler Individual Achievement Test, Third Edition (WIAT-III), Test of Word Reading Efficiency, Second Edition (TOWRE-II), and relevant subsections of the Comprehensive Test of Phonological Processing, Second Edition (CTOPP-2). Given that these diagnoses were based on clinical judgment rather than a specific research cutoff, one concern is that they may be overly conservative relative to more commonly used research definitions for learning disability in the literature. To address this concern, we expanded our learning disabilities groups to additionally include any participant with a WIAT-Word Reading score < 85 for Specific Learning Disability-Reading (called the SLD-Reading group), and a WIAT Numerical Operations score < 85 for Specific Learning Disability-Math (called the SLD-Math group). Categorical regression analyses included these groups based on the above criteria.

Assessing the Costs of Assessments

We further discussed the cost of each assessment identified above. A goal of this research effort is to reduce the testing burden in terms of cost, primarily as measured in units of clinician time. The table below breaks down the testing performed according to the test format and test administrator requirements. Candidate solutions for alternate testing protocols should seek to minimize the clinician-administered tests as much as possible.

test_costs <- readr::read_csv("test_costs_and_format.csv")
DT::datatable(test_costs, filter = "top")